Goal: Load Twitter API into a data management system and write queries to retrieve data.
| user_id | screen_name | followers_count | statuses_count | friends_count | account_created_at | verified |
|---|---|---|---|---|---|---|
| x44196397 | elonmusk | 95589999 | 18134 | 114 | 2009-06-02 | TRUE |
Apache Spark is an open-source engine for large-scale
parallel data processing known for its speed, ease of use, and
cutting-edge analytics. It provides high-level APIs in general-purpose
programming languages such as Scala, Python, and R, as well as an
optimization engine supporting standard data analysis methods.
Azure Databricks is an analytics platform based on
Microsoft Azure cloud services, enabling the latest versions of Apache
Spark and open source libraries. Built with Spark capabilities,
Databricks provides a cloud-based interactive workspace with fully
managed Spark clusters. This allows users to work in a single,
easy-to-use environment, create and configure clusters in seconds, and
quickly execute Spark code.
Twitter (Elon Musk 2015-2022):
Dataset of Elon Musk’s most recent Tweets during 2015-2022, stored in
RDS format, where each tweet is in its own separate row object. All
Tweets are collected, parsed, and plotted using rtweet in
R. In total, there are more than thousands of tweets in this dataset,
including retweets and replies. All objects are to go into a single
database.
| 1 | status_id | 14 | hashtags | 27 | quoted_followers_count | 40 | retweet_location |
| 2 | created_at | 15 | symbols | 28 | quoted_location | 41 | retweet_description |
| 3 | user_id | 16 | media_expanded_url | 29 | quoted_description | 42 | retweet_verified |
| 4 | screen_name | 17 | media_type | 30 | quoted_verified | 43 | name |
| 5 | text | 18 | mentions_screen_name | 31 | retweet_status_id | 44 | location |
| 6 | source | 19 | quoted_status_id | 32 | retweet_text | 45 | description |
| 7 | reply_to_screen_name | 20 | quoted_text | 33 | retweet_created_at | 46 | followers_count |
| 8 | is_quote | 21 | quoted_created_at | 34 | retweet_source | 47 | friends_count |
| 9 | is_retweet | 22 | quoted_source | 35 | retweet_favorite_count | 48 | statuses_count |
| 10 | favorite_count | 23 | quoted_favorite_count | 36 | retweet_retweet_count | 49 | account_created_at |
| 11 | retweet_count | 24 | quoted_retweet_count | 37 | retweet_user_id | 50 | verified |
| 12 | quote_count | 25 | quoted_user_id | 38 | retweet_screen_name | ||
| 13 | reply_count | 26 | quoted_screen_name | 39 | retweet_followers_count |
After setting up the Twitter application and loading the
rtweet package in R, we use the following command to create
a token with application authentication, allowing access to Twitter
data.
library(rtweet) # load rtweet package
twitter_token <- create_token(
app = "mytwitterapp",
consumer_key = "api_key", consumer_secret = "api_secret",
access_token = "access_token", access_secret = "access_secret")
use_oauth_token(twitter_token) # authenticate via web browser
Now, we search Twitter’s full archive API, running the
search_fullarchive command for access to user historical
tweets. The example below captures Elon Musk’s tweets from January 01,
2010, to May 28, 2022.
df <- search_fullarchive(q = "from:elonmusk", n = 10000, env_name = enviroment_name, fromDate = "201001010000", toDate = "202205280000")
| created_at | screen_name | text | favorite_count | retweet_count | quote_count | reply_count | is_quote | is_retweet |
|---|---|---|---|---|---|---|---|---|
| 2022-06-20 23:52:18 | elonmusk | (Degentraland?) Artificial Insemination? | 6963 | 270 | 66 | 830 | FALSE | FALSE |
| 2022-06-20 23:43:38 | elonmusk | AI gets better every day https://t.co/Lz5XfXRJjh | 90161 | 9333 | 1075 | 8200 | FALSE | FALSE |
| 2022-06-20 21:34:57 | elonmusk | Some great suggestions in the comments! | 22970 | 1078 | 101 | 4214 | FALSE | FALSE |
(path finding) Display the thread (replies) of tweets
(the tweet, time, id, in reply to id, user name with their screen name)
posted by Elon Musk with screen_name in the order in which they were
posted.(location) From which location have the tweets been
most actively posted (most number of tweets)?(hashtags) Which hashtags does Musk use the most, and
how many tweets are associated with these hashtags?(topics) What word does Musk mention the most in his
tweets? What company products does Musk mention the most in his tweets?
Products include Falcon 9, Starlink Satellites, Model 3 cars, etc.(trending) Are there any trends of what Musk tweets
about the company?(nature of engagement) What is the percentage of
different types of tweets (simple tweet, reply, retweet, quoted tweet)
to their overall number of tweets?Building a Databricks workspace using an Apache Spark cluster.
First, we create a Databricks workspace from the Azure portal and then launch the workspace, which redirects us to the interactive Databricks portal. We create a Spark cluster from the Databricks interactive workspace and configure a notebook on the cluster. In the notebook, we can use either PySpark or SparkR to read data from a dataset into a Spark DataFrame. Using the Spark DataFrame, we can run a Spark SQL job to query the data.
R Code:
require(SparkR)
df <- read.df("dbfs:/FileStore/dfclean.csv", source = "csv",
header = "true", inferSchema = "true")
createOrReplaceTempView(df, "twitterTemp") # register table for SQL
The Azure Databricks system stores this data file in FileStore, located in the FileStore/ folder, which we then use the above SparkR command to read in the data as a dataframe.
Users can perform relational procedures on DataFrames using a
domain-specific language (DSL) similar to R dataframes and Python
Pandas. DataFrames support standard relational operators, including
projection (select), filter (where),
join, and aggregations (group by).
To track Elon Musk’s engagement with people on Twitter, we want to look into tweets containing conversations with and directed to other users. We begin by unpacking information for each of Elon Musk’s tweets that mention another person’s username. Specifically, mentions are a type of tweet containing other account usernames, preceded by the “@” symbol.
R Code:
dfMentions <- SparkR::select(df, "created_at", "mentions_user_id",
"mentions_screen_name", "text")
dfMentions <- SparkR::filter(dfMentions, dfMentions$mentions_user_id != NA)
createOrReplaceTempView(dfMentions, "dfMentions")
| created_at | mentions_user_id | mentions_screen_name | text |
|---|---|---|---|
| 2022-06-17 16:51:17 | x1016059981907386368 x15854702 x10850192 x912969880852299776 | teslaownersSV UAW GM klwtts | (teslaownersSV?) (UAW?) (GM?) (klwtts?) Yup |
| 2022-06-17 16:17:14 | x34743251 | SpaceX | Liftoff! https://t.co/28eNKniMqe |
| 2022-06-18 02:56:18 | x87101095 x13298072 x34743251 x14171401 x1263491240336769026 x998140573008695296 x1957568797 x985686123123949568 x1285872552766734346 | blueskykites Tesla SpaceX mayemusk WholeMarsBlog 28delayslater JohnnaCrider1 Kristennetten SirineAti | (blueskykites?) (Tesla?) (SpaceX?) (mayemusk?) (WholeMarsBlog?) (28delayslater?) (JohnnaCrider1?) (Kristennetten?) (SirineAti?) (GailAlfarATX?) (DimaZeniuk?) (bevedoni?) (RationalEtienne?) (ashleevance?) (adamhoov?) (klwtts?) (RenataKonkoly?) For a couple of months, but, yeah, that looks like the place. Does Mark still live there? |
| 2022-06-20 17:54:19 | x1016059981907386368 x34743251 | teslaownersSV SpaceX | (teslaownersSV?) (SpaceX?) Super talented team at SpaceX |
As shown above, there exist tweets containing multiple mentioned usernames within the body of the text, all grouped together in a single row. So now we must manipulate the data so that each mentioned user for a tweet forms its own row, which allows us to count the total number of times Musk mentioned a unique user.
dfMentions %>%
tibble(user = str_extract_all(text, "@\\w+")) %>%
tidyr::unnest_longer(user) %>%
dplyr::count(user, sort = TRUE)
The above command uses the str_extract_all() function to
extract the mentioned users for each tweet and
unnest_longer() to transform the nested lists into tidy
rows so that each row contains only one user. Lastly, we count the total
number of observations for each unique user.
SQL Query:
SELECT mentions_screen_name, COUNT(*) AS n
FROM mentionDF
WHERE mentions_screen_name != 'NA'
GROUP BY mentions_screen_name
SORT BY n DESC;
SpaceX and Tesla are the most frequently mentioned users in Musk’s tweets. Linking conversations together, a reply is a type of tweet sent in direct response to another user’s tweet. Similar to mentions, replies allow users to direct tweets toward other twitter users and interact in conversations. Following the same general procedure above, we obtain the following results.
The different type of tweets that exist are general tweets, mentions, replies, retweets, and quotes. The mention and reply tweet types are already defined in the previous section. General tweets are original twitter posts containing text, photos, a GIF, and/or video, but do not include any mentions, replies, retweets, or quotes. Lastly, retweets and quotes are both re-postings of another person’s tweet, although quotes allow users to post another person’s tweet with their own added comment.
R Code:
nGeneral <- nrow(df[df$is_retweet == FALSE, ] %>% # removes retweets
subset(is.na(reply_to_status_id))) # removes replies
nMentions <- nrow(subset(df, !is.na(df$mentions_user_id)))
nReplies <- nrow(subset(df, !is.na(df$reply_to_status_id)))
nRetweets <- nrow(nRetweets <- df[df$is_retweet == TRUE, ])
nQuotes <- nrow(df[df$is_quote == TRUE, ])
tweetTypes <- data.frame(
type = c("General", "Mentions", "Replies", "Retweets", "Quotes"),
count = c(nGeneral, nMentions, nReplies, nRetweets, nQuotes))
In the above, we create five different data sets either containing only general tweets, mentions, replies, retweets, or quotes. We then count the number of observations for each data set and store it in a separate data frame containing the tweet type and its respective count.
Now, let’s display a table holding information for each of Elon Musk’s retweets and query the data to obtain the usernames of Musk’s most frequently retweeted users. To identify the most frequently retweeted users, we use tidyr tools to unnest, count, and sort each user from Musk’s retweets.
dfRetweet <- df %>% dplyr::filter(is_retweet == TRUE) %>%
dplyr::select(
retweet_created_at, retweet_text, retweet_user_id,
retweet_screen_name, retweet_favorite_count,
retweet_retweet_count, retweet_location) %>%
dplyr::arrange(desc(retweet_created_at))
dfRetweet %>% group_by(retweet_screen_name) %>%
tally(sort = TRUE)
| created_at | text | user_id | screen_name | favorite_count | retweet_count | location |
|---|---|---|---|---|---|---|
| 2022-04-27 23:41:36 | Docking confirmed! https://t.co/RyPZBAv5Lo | x34743251 | SpaceX | 128109 | 6890 | Hawthorne, CA |
| 2020-05-14 23:22:56 | Breakthrough, Part Deux (LVCVA?) https://t.co/JQ7sjPXSfZ | x859816394556284929 | boringcompany | 7278 | 649 | Austin/Las Vegas |
| 2016-12-05 06:33:41 | We’re releasing Universe, a platform for measuring and training AI agents: https://t.co/bx7OjMDaJK | x4398626122 | OpenAI | 3088 | 1759 | NA |
| 2016-03-31 20:10:58 | Longest line EVER #Model3 https://t.co/EafMhK3BDC | x13298072 | Tesla | 5487 | 3059 | NA |
The following parses the created_at datetime column to
display the year, month, day, and hour for each tweet.
dfTime <- select(df, c("status_id", "created_at"))
dfTime$created_at <- to_timestamp(dfTime$created_at)
dfTime$year <- year(dfTime$created_at)
dfTime$month <- date_format(to_date(dfTime$created_at), "MMMM")
dfTime$weekday <- date_format(to_date(dfTime$created_at), "EEEE")
createOrReplaceTempView(dfTime, "dfTime")
The resulting schema for the above dataframe consists of a string, datetime, integer, string, and string for the status_id, created_at, year, month, weekday columns, respectively.
| created_at | year | month | weekday | time |
|---|---|---|---|---|
| 2022-06-03 00:59:42 | 2022 | June | Friday | night |
| 2021-10-14 08:34:18 | 2021 | October | Thursday | morning |
| 2021-06-04 01:54:58 | 2021 | June | Friday | night |
| 2021-01-25 05:23:24 | 2021 | January | Monday | morning |
| 2020-01-17 01:56:37 | 2020 | January | Friday | night |
| 2019-09-19 14:45:13 | 2019 | September | Thursday | afternoon |
SQL Query:
SELECT weekday AS created_weekday, COUNT(*) AS n
FROM dfTime
GROUP BY created_weekday
ORDER BY n DESC;
Figuring out the most common words in Elon Musk’s tweets involves text mining tasks. The first step is to clean up the text from our dataset by using lowercase and removing punctuation, usernames, links, etc. We then use R tidy tools to convert the text to tidy formats and remove stop words.
R Code:
library(tidytext)
library(stringr)
dfWords <- df %>%
dplyr::mutate(text = str_remove_all(text, "&|<|>"),
text = str_remove_all(text, "\\s?(f|ht)(tp)(s?)(://)([^\\.]*)[\\.|/](\\S*)"),
text = str_remove_all(text, "[^\x01-\x7F]")) %>%
unnest_tokens(word, text, token = "tweets") %>%
dplyr::filter(!word %in% stop_words$word,
!word %in% str_remove_all(stop_words$word, "'"),
str_detect(word, "[a-z]"),
!str_detect(word, "^#"),
!str_detect(word, "@\\S+")) %>%
dplyr::count(word, sort = TRUE)
In the above command, the pattern matching function
str_remove_all() removes unwanted text, and the
unnest_tokens() function splits the text of each tweet into
tokens, using a one-word-per-row format. We then use the
str_detect() function to filter out words by removing stop
words, unicode characters, and whitespace.
Here we use the syuzhet R package to iterate over a
vector of strings consisting of the text from all of Elon Musk’s tweets
in our dataset. To obtain the vector of tweet text, the plain_tweets()
function from the rtweet package is used to clean up the
tweets character vector to cleaned up, plain text. We then pass this
vector to the get_sentiment() function, which consequently returns the
sentiment values based on the custom sentiment dictionary developed from
a collection of human coded sentences.
R Code:
round_time <- function(x, secs)
as.POSIXct(hms::round_hms(x, secs))
sent_scores <- function(x)
syuzhet::get_sentiment(plain_tweets(x)) - .5
df.sentiment <- gfg_data %>%
dplyr::mutate(days = round_time(created_at, 60 * 60 * 24),
sentiment = sent_scores(text)) %>%
dplyr::group_by(days) %>%
dplyr::summarise(sentiment = sum(sentiment, na.rm = TRUE))
Extending the above sentiment analysis, the next step is to
understand the opinion or emotion in the text. First, we must clean the
text from our dataset so that it’s in a tidy format. We accomplish this
using the R function gsub() to replace unwanted text and
the get_nrc_sentiment() function to get the emotions and
valences from the NRC sentiment dictionary for each word from all of
Musk’s tweet.
R Code:
txt <- c("rt|RT", "http\\w+", "<.*?>", "@\\w+", "[[:punct:]]", "\r?\n|\r", "[[:digit:]]", "[ |\t]{2,}", "^ ", " $")
cleanTweet <- as.vector(df$text)
cleanTweet <- grep::gsub(txt, "", cleanTweet)
textSentiment <- syuzhet::get_nrc_sentiment(cleanTweet)
nrc_sentiment <- cbind(df, textSentiment) %>%
dplyr::select(created_at, anger, anticipation, disgust, fear,
joy, sadness, surprise, trust, negative, positive)
In the above command, the gsub function replaces all occurrences of
the given patterns and the get_nrc_sentiment function calculates the
presence of eight different emotions and their corresponding valence.
The resulting columns include the eight emotions disgust,
fear, joy, sadness,
surprise, trust and their respective
positive or negative valence.